D208 P.A. : TASK 2 LOGISTIC REGRESSION FOR PREDICTIVE MODELING

Link to the Panopto Video

https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=8a3aa9f0-8b69-438a-a653-ad0b013b2b44

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part I: Research Question

A. Describe the purpose of this data analysis by doing the following:

1. Summarize one research question that is relevant to a real-world organizational situation captured in the data set you have selected and that you will answer using logistic regression.

“Given that it costs 10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. For many providers, retaining highly profitable customers is the number one business goal. To reduce customer churn, telecommunications companies need to predict which customers are at high risk of churn.” (D207 D208 D209 Churn Data Consideration and Dictionary.pdf) In addition to prediction, the company may be interested in finding the main factors that may affect the churn possibility positively and negatively, therefore taking the right actions to deal with those factors. The analysis of the data would give the company and the stakeholders a good idea about these factors and the degree of their reliability.

The question to be asked is about recognizing which variables have a relationship with the customer's Churn, and using these relationships to predict customers churn probability

logistic regression will be used to check the factors and features that affect the customers churn probability and to build a prediction model based on these features.

2. Define the objectives or goals of the data analysis. Ensure that your objectives or goals are reasonable within the scope of the data dictionary and are represented in the available data.

the churn probability of each customer may depend on several factors such as customer satisfaction, services quality and price. etc, some of these factors may be more important than others. and more significant than others. the objective of the data analysis is to identify these features and to test their significance in order to build a logistic regression prediction model that can be used to predict the customer churn probability based on available features or criteria. thus giving the stakeholders the insight to avoid the negative factors and to support the positive ones that may decrease the customers churn probability.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part II: Method Justification

B. Describe logistic regression methods by doing the following:

1. Summarize the assumptions of a logistic regression model.

APPROPRIATE OUTCOME STRUCTURE Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.

OBSERVATION INDEPENDENCE Logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.

ABSENCE OF MULTICOLLINEARITY Logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.

LINEARITY OF INDEPENDENT VARIABLES AND LOG ODDS Logistic regression assumes linearity of independent variables and log odds. Although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.

LARGE SAMPLE SIZE logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).

https://www.statology.org/assumptions-of-logistic-regression/

https://www.lexjansen.com/wuss/2018/130_Final_Paper_PDF.pdf#:~:text=Logistic%20and%20Linear%20Regression%20Assumptions%3A%20Violation%20Recognition%20and,in%20any%20analytic%20plan%2C%20regardless%20of%20plan%20complexity.

2. Describe the benefits of using the tool(s) you have chosen (i.e., Python, R, or both) in support of various phases of the analysis.

Selected Python, the general-purpose, interpreted, object-oriented language, which supports many useful packages for creating linear models Selected Python libraries such as:

3. Explain why logistic regression is an appropriate technique to analyze the research question summarized in Part I.

The Churn is a binary categorical variable (Yes/No), it's probability may depend on several factors such as customer satisfaction, services quality and price. etc, some of these factors may be more important than others. and more significant than others.

Logistic regression is an appropriate technique to check the factors and features that affect the customers churn probability and the significance of each of them, by modelling the relationship between multiple explanatory variables to the single dependent binary categorical variable ('Churn'), through classification and thus estimating the probability of the Churn to happen.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Exploring the Data

Importing Libraries:

Reading the CSV Data:

Data info:

Summary statistics:

Data visualization:

*Function plt_summary() to inspect the data visually usind Histogram, Boxplot and scattered plots. The following function is defined to visually identify statistical parameters and to get the sense from the Data, such as identify the outliers , ranges, dominant values,.etc

Converting Binary categories into numeric:

*Function cat2num() : to convert categorical variables into serial numeric values in integer format

Variables Correlation:

*Function plot_corr_ellipses() imported and modified From text Book "Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python",(Bruce et al.,2020) - The associated GitHub code repository

https://github.com/gedeck/practical-statistics-for-data-scientists/blob/master/python/notebooks/Chapter%201%20-%20Exploratory%20Data%20Analysis.ipynb

PCA:
Conclusion of Data Exploration:

Part III: Data Preparation

C. Summarize the data preparation process for logistic regression by doing the following:

1. Describe your data preparation goals and the data manipulations that will be used to achieve the goals.

data preparation goals:

2. Discuss the summary statistics, including the target variable and all predictor variables that you will need to gather from the data set to answer the research question.
3. Explain the steps used to prepare the data for the analysis, including the annotated code.
4. Generate univariate and bivariate visualizations of the distributions of variables in the cleaned data set. Include the target variable in your bivariate visualizations.

Preparing for a Test model

Initializing Test model

List of significant and non significant columns from the test model
Applying variance inflation factor test (VIF)
List of significant and non significant columns from the test model, VIF and modelling iterations
5. Provide a copy of the prepared data set.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part IV: Model Comparison and Analysis

D. Compare an initial and a reduced logistic regression model by doing the following:

1. Construct an initial logistic regression model from all predictors that were identified in Part C2

Initial model

2. Justify a statistically based variable selection procedure and a model evaluation metric to reduce the initial model in a way that aligns with the research question.

Mainly based on statistical significance (P>|z|) value, with significance limit= 0.05 , the variables with P-value less than 0.05 represnt the rejection of the null hypthesis , mening that these variables are significant and relevant to the research question (prediction of 'Churn'). variables with with P-value larger than 0.05 represent failure to reject the null hypthesis meaning that the probability of giving them 0 coeeficient (or excluding them from the model) can be a relevant decesion.

3. Provide a reduced logistic regression model.

Note: The output should include a screenshot of each model.

List of used significant variables , and their relationship with the 'Churn' :
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

E. Analyze the data set using your reduced logistic regression model by doing the following:

  1. Explain your data analysis process by comparing the initial and reduced logistic regression models, including the following elements:

• the logic of the variable selection technique

Mainly based on statistical significance (P>|z|) value, with significance limit= 0.05 , the variables with P-value less than 0.05 represnt the rejection of the null hypthesis , mening that these variables are significant and relevant to the research question (prediction of 'Churn'). variables with with P-value larger than 0.05 represent failure to reject the null hypthesis meaning that the probability of giving them 0 coeeficient (or excluding them from the model) can be a relevant decesion.

The final set of variables has been chosen based on sequential iterations, by running several models, eliminating non-significant features, then rerunning the model again and so on, until reaching the final model, not all the runs were included in this notebook (this notebook includes only a test model, initial model and a final reduced model).

• the model evaluation metric

2. Provide the output and any calculations of the analysis you performed, including a confusion matrix.

Note: The output should include the predictions from the refined model you used to perform the analysis.

3. Provide the code used to support the implementation of the logistic regression models.

Included in this notebook

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part V: Data Summary and Implications

F. Summarize your findings and assumptions by doing the following:

1. Discuss the results of your data analysis, including the following elements:
• a regression equation for the reduced model
• an interpretation of coefficients of the statistically significant variables of the model

The regression equation result represents the Log of the odds of 'Churn', meaning that the probability of churn is above 0.5 when the (Log of the odds of 'Churn') is positive, and the probability of churn is below 0.5 when the (Log of the odds of 'Churn') is negative, unstandardized inputs have been used for easier interpretation of the "Log of the odds" equation as following :

(note: *Some minor difference in the coef. values because of regenerating the models)

etc...

• the statistical and practical significance of the model
• the limitations of the data analysis

https://careerfoundry.com/en/blog/data-analytics/what-is-logistic-regression/#:~:text=Here%20are%20a%20few%20takeaways%20to%20summarize%20what,analysis%2C%20and%20different%20types%20of%20logistic%20regression.%20

2. Recommend a course of action based on your results.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part VI: Demonstration

G. Provide a Panopto video recording that includes all of the following elements:

Link to the Panopto Video

https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=8a3aa9f0-8b69-438a-a653-ad0b013b2b44

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

H. List the web sources used to acquire data or segments of third-party code to support the application. Ensure the web sources are reliable.

I. Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.

J. Demonstrate professional communication in the content and presentation of your submission.

References:

1- Bruce, P. C., Bruce, A., & Gedeck, P. (2020). Practical statistics for data scientists: 50 essential concepts. Sebastopol, CA: O'Reilly Media, Incorporated.

2- https://github.com/gedeck/practical-statistics-for-data-scientists/blob/master/python/notebooks/Chapter%201%20-%20Exploratory%20Data%20Analysis.ipynb

3- (WGU) Predictive Modeling – D208 course materials and Labs

4- General Questions/Answers from https://stackoverflow.com/

5- Documentation of Python packages: pandas, matplotlib,numpy,seaborn and scipy

6- https://careerfoundry.com/en/blog/data-analytics/what-is-logistic-regression/#:~:text=Here%20are%20a%20few%20takeaways%20to%20summarize%20what,analysis%2C%20and%20different%20types%20of%20logistic%20regression.%20